Skip to content

UCP/CORE: Add rc_gda transport to alias list of ib#11170

Merged
yosefe merged 7 commits intoopenucx:masterfrom
guy-ealey-morag:rc-gda-aliases
Mar 4, 2026
Merged

UCP/CORE: Add rc_gda transport to alias list of ib#11170
yosefe merged 7 commits intoopenucx:masterfrom
guy-ealey-morag:rc-gda-aliases

Conversation

@guy-ealey-morag
Copy link
Contributor

@guy-ealey-morag guy-ealey-morag commented Feb 5, 2026

What?

Add rc_gda transport to aliases list of ib and cuda as it depends on both

Why?

The NIXL test UcxHardwareWarningTest.WarnWhenIbPresentButRdmaNotSupported failed because setting UCX_TLS=^ib did not disable InfiniBand completely because the rc_gda transport remained available, causing the warning not to show when expected.

A temporary fix for the test will be pushed in parallel to let the tests pass.
ai-dynamo/nixl#1292

Copy link
Contributor

@ColinNV ColinNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the NIXL test.

$ bin/gtest --gtest_filter="UcxHardwareWarningTest.*"
Note: Google Test filter = UcxHardwareWarningTest.*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from UcxHardwareWarningTest
[ RUN      ] UcxHardwareWarningTest.WarnWhenGpuPresentButCudaNotSupported
W0205 18:58:18.931417  145837 ucx_utils.cpp:654] 1 NVIDIA GPU(s) were detected, but UCX CUDA support was not found! GPU memory is not supported.
[       OK ] UcxHardwareWarningTest.WarnWhenGpuPresentButCudaNotSupported (1490 ms)
[ RUN      ] UcxHardwareWarningTest.WarnWhenIbPresentButRdmaNotSupported
W0205 18:58:19.090172  145837 ucx_utils.cpp:662] 3 IB device(s) were detected, but accelerated IB support was not found! Performance may be degraded.
[       OK ] UcxHardwareWarningTest.WarnWhenIbPresentButRdmaNotSupported (155 ms)
[ RUN      ] UcxHardwareWarningTest.NoWarningWhenIbAndCudaSupported
[       OK ] UcxHardwareWarningTest.NoWarningWhenIbAndCudaSupported (154 ms)
[----------] 3 tests from UcxHardwareWarningTest (1799 ms total)
 
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (1799 ms total)
[  PASSED  ] 3 tests.

{ "dc_x", { "dc_mlx5", UCP_TL_AUX("ud_mlx5"), NULL } },
{ "ugni", { "ugni_smsg", UCP_TL_AUX("ugni_udt"), "ugni_rdma", NULL } },
{ "cuda", { "cuda_copy", "cuda_ipc", "gdr_copy", NULL } },
{ "cuda", { "cuda_copy", "cuda_ipc", "gdr_copy", "rc_gda", NULL } },
Copy link
Contributor

@brminich brminich Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we also want it in cuda alias. E.g. if I want to use UCX_TLS=rc,cuda it will automatically enable rc_gda (which requires a lot of resources). Also not sure if we need it in rc alias.
@ofirfarjun7, @Artemy-Mellanox wdyt?

Copy link
Contributor

@ofirfarjun7 ofirfarjun7 Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it in rc alias what am I'm missing?
We need to disable it if user disables ib/cuda. Do we have other conventional way to do it beside the aliases?
Regarding resources, with latest implementation it create additional transport for each NIC (NIC-GPU) transport , but just one not all combos (@Artemy-Mellanox please correct me if I'm wrong).

{ "dc_x", { "dc_mlx5", UCP_TL_AUX("ud_mlx5"), NULL } },
{ "ugni", { "ugni_smsg", UCP_TL_AUX("ugni_udt"), "ugni_rdma", NULL } },
{ "cuda", { "cuda_copy", "cuda_ipc", "gdr_copy", NULL } },
{ "cuda", { "cuda_copy", "cuda_ipc", "gdr_copy", "rc_gda", NULL } },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it should be only for IB

@guy-ealey-morag
Copy link
Contributor Author

Do we agree that rc_gda should be added to the ib alias but not the cuda alias?
@brminich @ofirfarjun7 @yosefe

@guy-ealey-morag
Copy link
Contributor Author

Do we agree that rc_gda should be added to the ib alias but not the cuda alias? @brminich @ofirfarjun7 @yosefe

Removed from the cuda alias

@guy-ealey-morag guy-ealey-morag changed the title UCP/CORE: Add rc_gda transport to aliases list of ib and cuda UCP/CORE: Add rc_gda transport to alias list of ib Feb 20, 2026
tvegas1
tvegas1 previously approved these changes Feb 20, 2026
@yosefe yosefe enabled auto-merge (squash) February 21, 2026 15:44
ofirfarjun7
ofirfarjun7 previously approved these changes Feb 22, 2026
@guy-ealey-morag guy-ealey-morag dismissed stale reviews from ofirfarjun7 and tvegas1 via 2e67fc2 February 27, 2026 15:47
tvegas1
tvegas1 previously approved these changes Feb 27, 2026
yosefe
yosefe previously approved these changes Mar 3, 2026
tvegas1
tvegas1 previously approved these changes Mar 4, 2026
Copy link
Contributor

@tvegas1 tvegas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment

@tvegas1
Copy link
Contributor

tvegas1 commented Mar 4, 2026

maybe we want to do the fd leak fix in separate PR? could it also be helpful in v1.20?

@guy-ealey-morag
Copy link
Contributor Author

guy-ealey-morag commented Mar 4, 2026

maybe we want to do the fd leak fix in separate PR? could it also be helpful in v1.20?

I decided with @yosefe to do this fix here because it's needed for the tests to pass.
The rest of the fd leaks are fixed in #11222

tvegas1
tvegas1 previously approved these changes Mar 4, 2026
@yosefe yosefe merged commit 00cac8e into openucx:master Mar 4, 2026
152 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants